Ford GoBike System Data

Introduction

Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles for short trips, typically 30 minutes or less. Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used. In this project, an exploratory analysis will be done over Ford GoBike data, a bike-share system provider.

Preliminary Wrangling

  • Source data : https://www.fordgobike.com/system-data
  • It includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area.
  • Multiple data files will need to be joined together in case a full year’s coverage is needed. In this project, we will focus on the record of individual trips taken in from June, 2018 to May, 2019.

  • The features included in this dataset : Trip Duration (seconds) , Start Time and Date , End Time and Date , Start Station ID , Start Station Name , Start Station Latitude , Start Station Longitude , End Station ID , End Station Name, End Station Latitude , End Station Longitude , Bike ID , User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual) , Member Year of Birth, Member Gender

In [1]:
# import all packages and set plots to be embedded inline
import os
import time
import glob
import numpy as np
import pandas as pd
import helpers as hp
import plotly.express as px
import plotly.graph_objects as go

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [2]:
# Reading all csv files at once and append them at one dataframe
path = r"D:\GitLabRepos\GoBike\gobike"
all_files = glob.glob(os.path.join(path, "*.csv"))
print("Concatincating files to one file...")
start_time = time.time()
df = pd.concat(
    (pd.read_csv(
        file, parse_dates=['start_time', 'end_time', 'member_birth_year'],
        dtype={"start_station_id":"O", "end_station_id":"O", "bike_id":"O"},
        nrows=10000
    ) for file in all_files), ignore_index=True
)
end_time = time.time()
print("done!")
print("It tooks {} seconds to read and concatnate datasets".format(round(end_time - start_time, 2)))
Concatincating files to one file...
done!
It tooks 0.57 seconds to read and concatnate datasets
In [3]:
df.head()
Out[3]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 59088 2018-06-30 23:32:44.659 2018-07-01 15:57:33.316 76 McCoppin St at Valencia St 37.771662 -122.422423 95 Sanchez St at 15th St 37.766219 -122.431060 2100 Subscriber 1975-01-01 Male Yes
1 60358 2018-06-30 21:48:19.557 2018-07-01 14:34:18.100 248 Telegraph Ave at Ashby Ave 37.855956 -122.259795 239 Bancroft Way at Telegraph Ave 37.868813 -122.258764 653 Customer NaT NaN No
2 63654 2018-06-30 20:26:53.031 2018-07-01 14:07:47.232 23 The Embarcadero at Steuart St 37.791464 -122.391034 50 2nd St at Townsend St 37.780526 -122.390288 3235 Subscriber 1962-01-01 Female No
3 50508 2018-06-30 20:29:59.813 2018-07-01 10:31:48.250 58 Market St at 10th St 37.776619 -122.417385 88 11th St at Bryant St 37.770030 -122.411726 3675 Subscriber 1992-01-01 Male No
4 51697 2018-06-30 18:24:56.696 2018-07-01 08:46:33.739 196 Grand Ave at Perkins St 37.808894 -122.256460 272 Woolsey St at Sacramento St 37.850578 -122.278175 3232 Customer 1989-01-01 Female No
In [4]:
# exploring NaNs
hp.explore_nans(df, "Exploring NaNs", chart_image=False)
In [5]:
# do some data cleaning : drop nan , remove some insignificant features, adding trip duration in minutes feature 
df.drop(['start_station_latitude','start_station_longitude', 'end_station_latitude', 'end_station_longitude'], axis=1, inplace=True)
df.dropna(inplace=True)
In [6]:
df.shape
Out[6]:
(113327, 12)

What is the structure of your dataset?

It has a shape of 12 columns and 113327 rows after concatinating the 12-months data files available

What is/are the main feature(s) of interest in your dataset?

  • What is the average trip duration?
  • Is season a factor to affect on trip duration?
  • Is season vs months a factor of interest to affect on trip duration?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

'duration_sec', 'start_time', 'end_time', 'user_type', 'member_gender'

Univariate Exploration

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

Question 1 : What is the average trip duration?

In [7]:
# Let's plot the distribution of trip duration.  
data = go.Histogram(x=df["duration_sec"])
layout = go.Layout(
    title="distribution of trip duration", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()
In [8]:
data = go.Box(y=df["duration_sec"], name="Trip Durations in seconds")
layout = go.Layout(
    title="distribution of trip duration", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()
  • From the two charts above, I can see a right skewed with a long tail on the right meaning that the data has a low variance.
  • Although these distribution charts showed us the shape of the data, they're still not enough to interpret and they need more percise look. A way to enhance this is to use log transformation which will be done on the next cell.
  • Measuring trip using seconds is not reasonable. I would change the duration_sec to duration_min before I add any log transformation.
In [9]:
# adding new columns to answer the question more precisely
df["duration_min"] = df["duration_sec"] / 60
df['duration_min_log'] = np.log10(df['duration_min'])
In [10]:
# Let's plot the distribution of trip duration.  
data = go.Histogram(x=df["duration_min_log"])
layout = go.Layout(
    title="Distribution of trip duration after log transformation",
    xaxis={"showgrid":False, "title":"Duration in min"},
    yaxis= {"showgrid":False, "title":"Frequency"}
)
fig = go.Figure(data, layout)
fig.show()
In [11]:
data = go.Box(y=df["duration_min_log"], name="Trip Duration")
layout = go.Layout(
    title="Distribution of trip duration after log transformation", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

As seen before, it is difficult to read the plot in trip duration per second so that I tend to perform log transformation based on base 10 to plot a normally distributed shape and answer the question precisely. It looks like that most of the trips takes 10 minutes in average - short trips.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, I created a new variable called duration_min_log that holds the trip duration in minutes after a log transformation has been done.

Bivariate Exploration

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

Question 2. Is season a factor to affect on trip duration?

Answering this question requires creating new columns from the start_time column; start_month, start_day, start_hour and summer column for heatmap plot.

In [12]:
# create the columns
df['start_month'], df['start_day'], df['start_hour'] = (
    df['start_time'].dt.month_name(),
    df['start_time'].dt.day_name(),
    df['start_time'].dt.hour,
)
df["season"] = df.apply(hp.seasons, axis=1)  
df.head()
Out[12]:
duration_sec start_time end_time start_station_id start_station_name end_station_id end_station_name bike_id user_type member_birth_year member_gender bike_share_for_all_trip duration_min duration_min_log start_month start_day start_hour season
0 59088 2018-06-30 23:32:44.659 2018-07-01 15:57:33.316 76 McCoppin St at Valencia St 95 Sanchez St at 15th St 2100 Subscriber 1975-01-01 Male Yes 984.800000 2.993348 June Saturday 23 Summer
2 63654 2018-06-30 20:26:53.031 2018-07-01 14:07:47.232 23 The Embarcadero at Steuart St 50 2nd St at Townsend St 3235 Subscriber 1962-01-01 Female No 1060.900000 3.025674 June Saturday 20 Summer
3 50508 2018-06-30 20:29:59.813 2018-07-01 10:31:48.250 58 Market St at 10th St 88 11th St at Bryant St 3675 Subscriber 1992-01-01 Male No 841.800000 2.925209 June Saturday 20 Summer
4 51697 2018-06-30 18:24:56.696 2018-07-01 08:46:33.739 196 Grand Ave at Perkins St 272 Woolsey St at Sacramento St 3232 Customer 1989-01-01 Female No 861.616667 2.935314 June Saturday 18 Summer
7 7224 2018-06-30 23:01:00.400 2018-07-01 01:01:24.433 284 Yerba Buena Center for the Arts (Howard St at ... 284 Yerba Buena Center for the Arts (Howard St at ... 779 Subscriber 1989-01-01 Male No 120.400000 2.080626 June Saturday 23 Summer
In [13]:
# seasons vs median duration trips
season_duration_mean = df.groupby('season')['duration_min'].median().reset_index()
fig = go.Figure(
    go.Bar(
        x=season_duration_mean['season'].tolist(),
        y=season_duration_mean['duration_min'].tolist(),
        text=round(season_duration_mean['duration_min'], 2).astype(str).tolist(),
        textposition="auto"
    ),
    go.Layout(
        title="Average of duration trip per season in minutes",
        xaxis={"showgrid":False, "title":"Season"},
        yaxis={"showgrid":False, "title":"Duration Trip Mean"}
    )
)
fig.show()

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Due to outliers that exit heavily in this data, I chose to measure the average by median not mean to not mislead the results. Despite there is no significant difference in trip dutaion across seasons, the plot appears that spring has the longest median of trip duration. This was expected for me as in spring we have a very relxing whether experience that motivates going bicycling.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Actually, whether doesn't affect that much in SF. I don't know why but this might go back to unchanging extreme whether conditions.

Multivariate Exploration

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

Question 3. Is season vs months/days a factor of interest to affect on trip duration?

In [14]:
fig = go.Figure(
    go.Heatmap(
        z=df['duration_min_log'].tolist(),
        x=df['start_month'].tolist(),
        y=df['season'].tolist(),
        # hoverongaps = False
    ),
    go.Layout(
        title="Relationship between trip duration and months across year",
        xaxis={"showgrid":False, "title":"Months"},
        yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
        xaxis_type="category"
    )
)
fig.show()
In [15]:
fig = go.Figure(
    go.Heatmap(
        z=df['duration_min_log'].tolist(),
        x=df['start_day'].tolist(),
        y=df['season'].tolist(),
        # hoverongaps = False
    ),
    go.Layout(
        title="Relationship between trip duration and days across week",
        xaxis={"showgrid":False, "title":"Months"},
        yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
        xaxis_type="category"
    )
)
fig.show()

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I've created season column that to plot multivariate exploration between season, months and trip duration. From the heatmap above, we can see that longest trip durations are in the summer specifically in August. Winter, in Sep and Jan, comes in the second place, while spring months come with shortest trip durations across the year. Unlike the previous bar chart that shows the longest median duratoin trip is in spring, the heatmap suggests the summer as duration count with longest ones. The second heatmap also prove the same fact that the summer has the peak duration of trips with high frequency in Wednesdays

Were there any interesting or surprising interactions between features?

An interesting point is that useres take long duration trips in the summer although it's hot.

In [ ]: